83 research outputs found
Constructing ensembles for intrinsically disordered proteins
The relatively flat energy landscapes associated with intrinsically disordered proteins makes modeling these systems especially problematic. A comprehensive model for these proteins requires one to build an ensemble consisting of a finite collection of structures, and their corresponding relative stabilities, which adequately capture the range of accessible states of the protein. In this regard, methods that use computational techniques to interpret experimental data in terms of such ensembles are an essential part of the modeling process. In this review, we critically assess the advantages and limitations of current techniques and discuss new methods for the validation of these ensembles
Deep Metric Learning for the Hemodynamics Inference with Electrocardiogram Signals
Heart failure is a debilitating condition that affects millions of people
worldwide and has a significant impact on their quality of life and mortality
rates. An objective assessment of cardiac pressures remains an important method
for the diagnosis and treatment prognostication for patients with heart
failure. Although cardiac catheterization is the gold standard for estimating
central hemodynamic pressures, it is an invasive procedure that carries
inherent risks, making it a potentially dangerous procedure for some patients.
Approaches that leverage non-invasive signals - such as electrocardiogram (ECG)
- have the promise to make the routine estimation of cardiac pressures feasible
in both inpatient and outpatient settings. Prior models trained to estimate
intracardiac pressures (e.g., mean pulmonary capillary wedge pressure (mPCWP))
in a supervised fashion have shown good discriminatory ability but have been
limited to the labeled dataset from the heart failure cohort. To address this
issue and build a robust representation, we apply deep metric learning (DML)
and propose a novel self-supervised DML with distance-based mining that
improves the performance of a model with limited labels. We use a dataset that
contains over 5.4 million ECGs without concomitant central pressure labels to
pre-train a self-supervised DML model which showed improved classification of
elevated mPCWP compared to self-supervised contrastive baselines. Additionally,
the supervised DML model that uses ECGs with access to 8,172 mPCWP labels
demonstrated significantly better performance on the mPCWP regression task
compared to the supervised baseline. Moreover, our data suggest that DML yields
models that are performant across patient subgroups, even when some patient
subgroups are under-represented in the dataset. Our code is available at
https://github.com/mandiehyewon/ssldm
A Structure-free Method for Quantifying Conformational Flexibility in proteins
All proteins sample a range of conformations at physiologic temperatures and this inherent flexibility enables them to carry out their prescribed functions. A comprehensive understanding of protein function therefore entails a characterization of protein flexibility. Here we describe a novel approach for quantifying a protein’s flexibility in solution using small-angle X-ray scattering (SAXS) data. The method calculates an effective entropy that quantifies the diversity of radii of gyration that a protein can adopt in solution and does not require the explicit generation of structural ensembles to garner insights into protein flexibility. Application of this structure-free approach to over 200 experimental datasets demonstrates that the methodology can quantify a protein’s disorder as well as the effects of ligand binding on protein flexibility. Such quantitative descriptions of protein flexibility form the basis of a rigorous taxonomy for the description and classification of protein structure.Massachusetts Institute of Technology (Steve G. and Renee Finn Faculty Innovation Fellowship)Swiss National Science Foundation (Early Postdoc.Mobility Fellowship
Comparative Studies of Disordered Proteins with Similar Sequences: Application to Aβ40 and Aβ42
Quantitative comparisons of intrinsically disordered proteins (IDPs) with similar sequences, such as mutant forms of the same protein, may provide insights into IDP aggregation—a process that plays a role in several neurodegenerative disorders. Here we describe an approach for modeling IDPs with similar sequences that simplifies the comparison of the ensembles by utilizing a single library of structures. The relative population weights of the structures are estimated using a Bayesian formalism, which provides measures of uncertainty in the resulting ensembles. We applied this approach to the comparison of ensembles for Aβ40 and Aβ42. Bayesian hypothesis testing finds that although both Aβ species sample β-rich conformations in solution that may represent prefibrillar intermediates, the probability that Aβ42 samples these prefibrillar states is roughly an order of magnitude larger than the frequency in which Aβ40 samples such structures. Moreover, the structure of the soluble prefibrillar state in our ensembles is similar to the experimentally determined structure of Aβ that has been implicated as an intermediate in the aggregation pathway. Overall, our approach for comparative studies of IDPs with similar sequences provides a platform for future studies on the effect of mutations on the structure and function of disordered proteins
Intrinsically Disordered Proteins: Where Computation Meets Experiment
Proteins are heteropolymers that play important roles in virtually every biological reaction. While many proteins have well-defined three-dimensional structures that are inextricably coupled to their function, intrinsically disordered proteins (IDPs) do not have a well-defined structure, and it is this lack of structure that facilitates their function. As many IDPs are involved in essential cellular processes, various diseases have been linked to their malfunction, thereby making them important drug targets. In this review we discuss methods for studying IDPs and provide examples of how computational methods can improve our understanding of IDPs. We focus on two intensely studied IDPs that have been implicated in very different pathologic pathways. The first, p53, has been linked to over 50% of human cancers, and the second, Amyloid-β (Aβ), forms neurotoxic aggregates in the brains of patients with Alzheimer’s disease. We use these representative proteins to illustrate some of the challenges associated with studying IDPs and demonstrate how computational tools can be fruitfully applied to arrive at a more comprehensive understanding of these fascinating heteropolymers.National Science Foundation (U.S.). Directorate for Biological Sciences. Postdoctoral Research Fellowship (Grant 1309247
Sequential Multi-Dimensional Self-Supervised Learning for Clinical Time Series
Self-supervised learning (SSL) for clinical time series data has received
significant attention in recent literature, since these data are highly rich
and provide important information about a patient's physiological state.
However, most existing SSL methods for clinical time series are limited in that
they are designed for unimodal time series, such as a sequence of structured
features (e.g., lab values and vitals signs) or an individual high-dimensional
physiological signal (e.g., an electrocardiogram). These existing methods
cannot be readily extended to model time series that exhibit multimodality,
with structured features and high-dimensional data being recorded at each
timestep in the sequence. In this work, we address this gap and propose a new
SSL method -- Sequential Multi-Dimensional SSL -- where a SSL loss is applied
both at the level of the entire sequence and at the level of the individual
high-dimensional data points in the sequence in order to better capture
information at both scales. Our strategy is agnostic to the specific form of
loss function used at each level -- it can be contrastive, as in SimCLR, or
non-contrastive, as in VICReg. We evaluate our method on two real-world
clinical datasets, where the time series contains sequences of (1)
high-frequency electrocardiograms and (2) structured data from lab values and
vitals signs. Our experimental results indicate that pre-training with our
method and then fine-tuning on downstream tasks improves performance over
baselines on both datasets, and in several settings, can lead to improvements
across different self-supervised loss functions.Comment: ICML 202
Recommended from our members
ECG Morphological Variability in Beat Space for Risk Stratification After Acute Coronary Syndrome
Background: Identification of patients who are at high risk of adverse cardiovascular events after an acute coronary syndrome (ACS) remains a major challenge in clinical cardiology. We hypothesized that quantifying variability in electrocardiogram (ECG) morphology may improve risk stratification post‐ACS. Methods and Results: We developed a new metric to quantify beat‐to‐beat morphologic changes in the ECG: morphologic variability in beat space (MVB), and compared our metric to published ECG metrics (heart rate variability [HRV], deceleration capacity [DC], T‐wave alternans, heart rate turbulence, and severe autonomic failure). We tested the ability of these metrics to identify patients at high risk of cardiovascular death (CVD) using 1082 patients (1‐year CVD rate, 4.5%) from the MERLIN‐TIMI 36 (Metabolic Efficiency with Ranolazine for Less Ischemia in Non‐ST‐Elevation Acute Coronary Syndrome—Thrombolysis in Myocardial Infarction 36) clinical trial. DC, HRV/low frequency–high frequency, and MVB were all associated with CVD (hazard ratios [HRs] from 2.1 to 2.3 [P<0.05 for all] after adjusting for the TIMI risk score [TRS], left ventricular ejection fraction [LVEF], and B‐type natriuretic peptide [BNP]). In a cohort with low‐to‐moderate TRS (N=864; 1‐year CVD rate, 2.7%), only MVB was significantly associated with CVD (HR, 3.0; P=0.01, after adjusting for LVEF and BNP). Conclusions: ECG morphological variability in beat space contains prognostic information complementary to the clinical variables, LVEF and BNP, in patients with low‐to‐moderate TRS. ECG metrics could help to risk stratify patients who might not otherwise be considered at high risk of CVD post‐ACS
Motif Discovery in Physiological Datasets: A Methodology for Inferring Predictive Elements
In this article, we propose a methodology for identifying predictive physiological patterns in the absence of prior knowledge. We use the principle of conservation to identify activity that consistently precedes an outcome in patients, and describe a two-stage process that allows us to efficiently search for such patterns in large datasets. This involves first transforming continuous physiological signals from patients into symbolic sequences, and then searching for patterns in these reduced representations that are strongly associated with an outcome.
Our strategy of identifying conserved activity that is unlikely to have occurred purely by chance in symbolic data is analogous to the discovery of regulatory motifs in genomic datasets. We build upon existing work in this area, generalizing the notion of a regulatory motif and enhancing current techniques to operate robustly on non-genomic data. We also address two significant considerations associated with motif discovery in general: computational efficiency and robustness in the presence of degeneracy and noise. To deal with these issues, we introduce the concept of active regions and new subset-based techniques such as a two-layer Gibbs sampling algorithm. These extensions allow for a framework for information inference, where precursors are identified as approximately conserved activity of arbitrary complexity preceding multiple occurrences of an event.
We evaluated our solution on a population of patients who experienced sudden cardiac death and attempted to discover electrocardiographic activity that may be associated with the endpoint of death. To assess the predictive patterns discovered, we compared likelihood scores for motifs in the sudden death population against control populations of normal individuals and those with non-fatal supraventricular arrhythmias. Our results suggest that predictive motif discovery may be able to identify clinically relevant information even in the absence of significant prior knowledge.CIMIT: Center for Integration of Medicine and Innovative TechnologyHarvard University--MIT Division of Health Sciences and Technolog
Hidden States within Disordered Regions of the CcdA Antitoxin Protein
The bacterial toxin–antitoxin system CcdB–CcdA provides a mechanism for the control of cell death and quiescence. The antitoxin protein CcdA is a homodimer composed of two monomers that each contain a folded N-terminal region and an intrinsically disordered C-terminal arm. Binding of the intrinsically disordered C-terminal arm of CcdA to the toxin CcdB prevents CcdB from inhibiting DNA gyrase and thereby averts cell death. Accurate models of the unfolded state of the partially disordered CcdA antitoxin can therefore provide insight into general mechanisms whereby protein disorder regulates events that are crucial to cell survival. Previous structural studies were able to model only two of three distinct structural states, a closed state and an open state, that are adopted by the C-terminal arm of CcdA. Using a combination of free energy simulations, single-pair Förster resonance energy transfer experiments, and existing NMR data, we developed structural models for all three states of the protein. Contrary to prior studies, we find that CcdA samples a previously unknown state where only one of the disordered C-terminal arms makes extensive contacts with the folded N-terminal domain. Moreover, our data suggest that previously unobserved conformational states play a role in regulating antitoxin concentrations and the activity of CcdA’s cognate toxin. These data demonstrate that intrinsic disorder in CcdA provides a mechanism for regulating cell fate
- …